Data Analysis Process¶
Asking Questions (Objectives)¶
- Given Data and Ask Questions to be answered using analysis
Data Wrangling¶
- Gather Collect the data needed to answer your questions
- Assessing Explore the data , check data quality , identifying the issues
- Cleaning fix issues by modifying,Replacing,Renaming,Removing problamtic data
Perform EDA (Exploratory Data Analysis)¶
- Explore Data using statistics and Visuals
- Discover Data Pattern
- Understand Data Distribution
Draw Conclusion¶
- Summeriza key findings
Communicate Results¶
- Share the Result
Instgram Dataset Field Description¶
- Below is a description of column fields in the Dataset:
Core Fields¶
Impressions – total number of times the post was seen
From Home – impressions from followers’ home feed
From Hashtags – impressions from hashtags
From Explore – impressions from explore page
From Other – impressions from other sources (shares, profile, etc.)
Questions:¶
Which post got the most impressions?
Which post got the most likes?
Which post got the most saves?
What is the average number of impressions per post?
What is the average number of likes per post?
Which source gives most impressions (Home, Hashtags, Explore)?
Do posts with more hashtags get more impressions?
Do longer captions get more engagement?
Which posts bring the most profile visits?
Which posts bring the most followers?
What is the correlation between impressions and likes?
What is the correlation between impressions and saves?
What is the total engagement for each post?
Which posts have the highest engagement rate?
Does Explore page help increase impressions?
# load needed modules
import pandas as pd
# display all the columns
pd.options.display.max_columns = None
# Load the dataset into datframe
df = pd.read_csv(r'C:\Users\go\Downloads\.ipynb_checkpoints\Instgram.csv')
# display first three rows
df.head(3)
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | #finance #money #business #investing #investme... |
| 1 | 5394 | 2727 | 1838 | 1174 | 78 | 194 | 7 | 14 | 224 | 48 | 10 | Here are some of the best data science project... | #healthcare #health #covid #data #datascience ... |
| 2 | 4021 | 2085 | 1188 | 0 | 533 | 41 | 11 | 1 | 131 | 62 | 12 | Learn how to train a machine learning model an... | #data #datascience #dataanalysis #dataanalytic... |
df['Caption'][0]
'Here are some of the most important data visualizations that every Financial Data Analyst/Scientist should know.'
# print the data shape
df.shape
(119, 13)
- we found that our data contains 119 posts with 13 features
# investigate data properties
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119 entries, 0 to 118 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Impressions 119 non-null int64 1 From Home 119 non-null int64 2 From Hashtags 119 non-null int64 3 From Explore 119 non-null int64 4 From Other 119 non-null int64 5 Saves 119 non-null int64 6 Comments 119 non-null int64 7 Shares 119 non-null int64 8 Likes 119 non-null int64 9 Profile Visits 119 non-null int64 10 Follows 119 non-null int64 11 Caption 119 non-null object 12 Hashtags 119 non-null object dtypes: int64(11), object(2) memory usage: 12.2+ KB
Completeness: All 12 columns have 119 non-null entries, matching the presumed total number of rows (119). This means there are no missing values in the dataset. This simplifies cleaning immensely.
Data Types: The data types are correctly assigned:
Numerical Data (int64): All engagement and reach metrics (Impressions, From Home, Saves, Comments, Likes, etc.) are correctly read as integers. This is ideal for mathematical aggregation and statistical analysis.
Text Data (object): Caption and Hashtags are correctly identified as object (string) types, which is necessary for text analysis and feature engineering.
df_copy = df.copy()
#check for duplicates
df.duplicated().sum()
np.int64(17)
- in our data no duplicates
# check for missing value
df.isnull().sum()
Impressions 0 From Home 0 From Hashtags 0 From Explore 0 From Other 0 Saves 0 Comments 0 Shares 0 Likes 0 Profile Visits 0 Follows 0 Caption 0 Hashtags 0 dtype: int64
- in our data no missing values
All data in our file is critical . we don't need drop any column or rename any column¶
df.dtypes
Impressions int64 From Home int64 From Hashtags int64 From Explore int64 From Other int64 Saves int64 Comments int64 Shares int64 Likes int64 Profile Visits int64 Follows int64 Caption object Hashtags object dtype: object
Feature Engineering¶
- Create new useful columns
- (e.g., engagement rate, growth rate, time-based features)
df['engagement rate']=(df['Comments']+df['Likes']+df['Shares'])/df['From Home']*100
df.head(1)
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | engagement rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | #finance #money #business #investing #investme... | 6.805878 |
q_rate70 = df['engagement rate'].quantile(0.70)
q_rate70
np.float64(8.496093232684412)
- there are 70% of engagement rate in our data less than (8.5%)
df['reach class'] = df['engagement rate'].apply(lambda x : "High Rate" if x > q_rate70 else "Low Rate")
df.head(1)
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | engagement rate | reach class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | #finance #money #business #investing #investme... | 6.805878 | Low Rate |
# convert ['Hashtags'] and ['captions'] to string
df['Hashtags'] = df['Hashtags'].astype(str)
df['Caption'] = df['Caption'].astype(str)
#convert hashtags and caption to list (to acess on them)
df['Hashtags'] = df['Hashtags'].str.split('#')
df.head(1)
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | engagement rate | reach class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | [, finance , money , business , investing , in... | 6.805878 | Low Rate |
df['Hashtags'].dtype
dtype('O')
#explode hashtages to analysis it
df_hashtags= df.explode('Hashtags')
df_hashtags.shape
(2375, 15)
df_hashtags.head()
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | engagement rate | reach class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | 6.805878 | Low Rate | |
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | finance | 6.805878 | Low Rate |
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | money | 6.805878 | Low Rate |
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | business | 6.805878 | Low Rate |
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | investing | 6.805878 | Low Rate |
high_rate_posts=df[df['reach class']=='High Rate']
high_rate_posts.head(1)
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | engagement rate | reach class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 5394 | 2727 | 1838 | 1174 | 78 | 194 | 7 | 14 | 224 | 48 | 10 | Here are some of the best data science project... | [, healthcare , health , covid , data , datasc... | 8.984232 | High Rate |
#statical measures of High Rate posts
high_rate_posts[['Impressions','From Home','From Hashtags','From Explore','From Other','Follows']].describe()
| Impressions | From Home | From Hashtags | From Explore | From Other | Follows | |
|---|---|---|---|---|---|---|
| count | 36.000000 | 36.000000 | 36.000000 | 36.000000 | 36.000000 | 36.000000 |
| mean | 7407.361111 | 2408.722222 | 3446.972222 | 1283.055556 | 182.083333 | 29.722222 |
| std | 3740.155002 | 590.821056 | 2508.970847 | 2291.808618 | 171.485339 | 47.425497 |
| min | 3630.000000 | 1711.000000 | 621.000000 | 36.000000 | 23.000000 | 0.000000 |
| 25% | 4885.750000 | 2012.750000 | 1923.500000 | 285.500000 | 73.750000 | 6.000000 |
| 50% | 6449.000000 | 2195.000000 | 2351.000000 | 552.500000 | 112.500000 | 13.000000 |
| 75% | 9686.250000 | 2706.750000 | 4172.250000 | 1008.000000 | 227.000000 | 32.500000 |
| max | 17713.000000 | 4137.000000 | 11817.000000 | 12389.000000 | 794.000000 | 260.000000 |
# load needed modules
from collections import Counter
df['Hashtags']
0 [, finance , money , business , investing , in...
1 [, healthcare , health , covid , data , datasc...
2 [, data , datascience , dataanalysis , dataana...
3 [, python , pythonprogramming , pythonprojects...
4 [, datavisualization , datascience , data , da...
...
114 [, datascience , datasciencejobs , datascience...
115 [, machinelearning , machinelearningalgorithms...
116 [, machinelearning , machinelearningalgorithms...
117 [, datascience , datasciencejobs , datascience...
118 [, python , pythonprogramming , pythonprojects...
Name: Hashtags, Length: 119, dtype: object
Hashtags = [y for x in high_rate_posts['Hashtags'] for y in x]
Hashtags_exp=df.explode('Hashtags')
top_5_hashtags = Counter(Hashtags).most_common(5)
# top 5 hashtags gain reach in posts
top_5_hashtags
[('', 36),
('amankharwal\xa0', 36),
('python\xa0', 34),
('datascience\xa0', 33),
('dataanalytics\xa0', 32)]
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
#relationship between engagement rate and Impressions
sns.scatterplot(data=df,x='Impressions',y='engagement rate')
plt.title('Impressions vs engagement rate ')
plt.xlabel('Impressions')
plt.ylabel('engagement rate')
plt.show()
sns.regplot(data=df,x='Impressions',y='engagement rate')
plt.title('Impressions vs engagement rate ')
plt.xlabel('Impressions')
plt.ylabel('engagement rate')
plt.show()
df.plot(x='engagement rate',y='Impressions' , kind='scatter')
<Axes: xlabel='engagement rate', ylabel='Impressions'>
sns.displot(df['engagement rate'],kde=True)
<seaborn.axisgrid.FacetGrid at 0x25b7047f110>
df=df.explode('Hashtags')
top_15_hashtags = df['Hashtags'].value_counts().head(15)
top_15_hashtags
Hashtags
119
python 109
amankharwal 107
machinelearning 96
pythonprogramming 95
datascience 94
ai 91
artificialintelligence 89
data 88
dataanalytics 87
datascientist 83
pythonprojects 82
pythoncode 78
dataanalysis 77
deeplearning 75
Name: count, dtype: int64
df['Hashtags'].value_counts()[:9].plot(kind='bar')
<Axes: xlabel='Hashtags'>
plt.figure(figsize=(12,5))
sns.barplot(top_15_hashtags)
plt.xticks(rotation = 10)
plt.show()
plt.figure(figsize=(12,5))
sns.barplot(top_15_hashtags,orient='h')
plt.show()
Hash_avg_rate = df.groupby('Hashtags')['engagement rate'].mean().sort_values(ascending=False)
Hash_avg_rate15 = df.groupby('Hashtags')['engagement rate'].mean().sort_values(ascending=False).head(15)
Hash_avg_rate
Hashtags
sql 13.638220
mysql 13.638220
roadmap 11.158650
covid 11.027272
healthcare 11.027272
...
programmingmemes 5.215772
php 5.215772
programmers 5.215772
webdesign 5.215772
facebook 5.129651
Name: engagement rate, Length: 176, dtype: float64
plt.figure(figsize=(12,5))
sns.barplot(Hash_avg_rate15)
plt.xticks(rotation = 45)
plt.show()
px.bar(top_15_hashtags)
px.scatter(df,x='Follows',y = 'Impressions',trendline='ols',color='engagement rate')
corr_matrix = df[['engagement rate','Impressions','Follows','Likes']].corr()
corr_matrix
| engagement rate | Impressions | Follows | Likes | |
|---|---|---|---|---|
| engagement rate | 1.000000 | 0.366108 | 0.422574 | 0.595362 |
| Impressions | 0.366108 | 1.000000 | 0.884286 | 0.856445 |
| Follows | 0.422574 | 0.884286 | 1.000000 | 0.736817 |
| Likes | 0.595362 | 0.856445 | 0.736817 | 1.000000 |
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm')
<Axes: >
px.pie(df,names='reach class')
px.pie(names = top_15_hashtags.index , values=top_15_hashtags.values ,title='Top15 Hashtags')
##################################